Categorization for a global taxonomy

ABSTRACT

Methods and systems are provided for generating training data for training a classifier to assign nodes of a taxonomy graph to items based on item descriptions. Each node has a label. For each item, the system identifies for that item one or more candidate paths within the taxonomy graph that are relevant to that item. The system identifies the candidate paths based on content of the item description of that item matching labels of nodes. A candidate path is a sequence of nodes starting a root node of the taxonomy graph. For each identified candidate path, the system labels the item description with the candidate path equivalently with leaf node or label of the leaf node. The labeled item descriptions compose the training data for training the classifier.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application claims the benefit of U.S. Provisional Application No.62/645,106, filed on Mar. 19, 2018, which is incorporated by referenceherein in its entirety.

BACKGROUND

E-commerce search systems generally use keyword matching in an attemptto generate items relevant to a query. When a user (or software system)submits a query to a search system, the search system identifies items(e.g., shoes) in a database which have a degree of correlation to thewords in the user's query according to some metric. The search systemthen returns the most relevant items based on that metric as searchresults. For example, a company may sell apparel items through itse-commerce web site. The web site may describe one item as “red dress”and another item as “red shoes.” If a user of the web site submits aquery “red dress shoes,” the search system may determine that eachdescription will match on 2 out of 3 words of the query. Using a metricbased on number of word matches, the search system may generate arelevance score of ⅔ for each item. However, based on the user's likelyintent to find red shoes that are dress shoes, the “red dress” item islikely to be less relevant than the “red shoes” item. Even the “redshoes” item may not be particularly relevant if the “red shoes” item isa casual shoe rather than a dress shoe.

To improve the relevance of search results, more sophisticated searchsystems employ concept mapping. A concept map is a hierarchicalorganization of concepts such as category relations. For example, onecategory relation may be “clothing/dresses” and another may be“footwear/shoes/dress shoes.” A “red dress” item may be associated withthe “clothing/dresses” category relation, and a “red shoes” item may beassociated with both the “footwear/shoes/dress shoes” and the“footwear/shoes/casual shoes” category relations. When the initialsearch results are “red dress” item and “red shoes” item, the searchsystem uses category relations to determine that the “red shoes” item islikely more relevant because “red dress shoes” is more similar to thecategory relation of “footwear/shoes/dress shoes” than the categoryrelation of “clothing/dresses.” Although the search system may includeboth items in the search results, the “red shoes” item would have ahigher ranking.

The hierarchical organization of concepts allows for more generalqueries and supports suggestions. For example, when a query is“footwear,” the search system may generate search results that includeall footwear items including the “red shoes” item. In contrast, when aquery is “red shoes” and the category relations include “shoes/dressshoes” and “shoes/casual shoes,” the search system may first identifyboth categories of “shoes” as relevant. Because “dress shoes” and“casual shoes” are both sub-categories of “shoes,” the search system mayask the user: “Are you interested in red dress shoes or red casualshoes?” When the user answers “dress,” the search system would rankitems mapped to “shoes/dress shoes” higher in the search results thanitems mapped to “shoes/casual shoes.”

Thus, concept mapping allows for associating items, based on theirdescriptions, to one or more concepts. The concepts for an item may beexplicit or implicit in the item's description although preferable to beboth explicit and hierarchical.

A search system may generate concept maps automatically using techniquessuch as topic models and latent semantic indexing. Different conceptmaps may be generated depending on the items and their descriptions usedto generate the concept maps. Unfortunately, such automaticallygenerated concept maps may bear little relation to the organization ofcategories that a person would generate.

Rather than generating a concept map, a search system may generate asemantic similarity model. A semantic similarity model is generated by aprocess of learning implicit relationships between documents and queriesbased on a large history of queries and documents selected by users.Semantic similarity models do not generate a hierarchy of categories andrely on historical data being available. Semantic similarity modelsallow for a high degree of automation and can improve search results inmany cases, but at the risk of a high error rate when concepts areinaccurately mapped to queries.

A search system may also employ a predefined taxonomical hierarchy ortaxonomy graph of categories based on expert knowledge. Defining such ataxonomy graph may require defining thousands of nodes, which can be avery complex and time-consuming task. However, the taxonomy graph needonly be created once as it reflects the actual organization ofcategories as defined by a person to represent a real-world organizationof categories. Experts may also define synonyms for concepts to allowmatching on terms which are in general use, rather than just the termsused in one specific e-commerce environment. Expert knowledge may besupplemented by automated collection of terms and synonyms, such as byvisiting (e.g., automated web crawling) a large number of relatede-commerce environments and collecting commonly used terms related to aparticular node in the taxonomy graph.

Once a taxonomy graph is defined, nodes may be assigned to items using aclassifier that is generated using machine learning based on trainingdata generated manually by an expert. The training data assigns nodes toitems. The generating of such training data is both onerous andtime-consuming because an expert needs to then search through thousandsof nodes to find the most relevant node and multiple nodes may havesimilar relevance. The classifier generated using training data for onee-commerce environment may not be an effective classifier for anothere-commerce environment because item descriptions and terminology willdiffer from environment to environment. For example, the taxonomy graphfor a web site for selling sport apparel may not be useful for a website selling children apparel. As a result, a taxonomy graph may need tobe generated for each web site, which can be expensive andtime-consuming.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a taxonomy graph.

FIG. 2 illustrates a taxonomy definition table defining the taxonomygraph of FIG. 1 .

FIG. 3 illustrates a table indicating the paths of the taxonomy graph ofFIG. 1 .

FIG. 4 illustrates a table containing lookup sets generated from thetaxonomy definition table of FIG. 2 .

FIG. 5 illustrates an item description.

FIG. 6 is a flow diagram that illustrates high-level processing of acomponent that assigns a node of the taxonomy graph to an itemdescription in some embodiments.

FIG. 7 is a flow diagram that illustrates high-level processing of acomponent that generates a classifier for the categorization system insome embodiments.

FIG. 8 is a block diagram illustrating components of the categorizationsystem in some embodiments.

FIG. 9 is a flow diagram that illustrates the processing of a preprocessgraph component of the categorization system in some embodiments.

FIG. 10 is a flow diagram that illustrates the processing of a generatelookup sets component of the categorization system in some embodiments.

FIG. 11 is a flow diagram that illustrates the processing of a generateclassifier component of the categorization system in some embodiments.

FIG. 12 is a flow diagram that illustrates the processing of a generatetraining data component of the categorization system in someembodiments.

FIG. 13 is a flow diagram that illustrates the processing of an identifycategory paths component of the categorization system in someembodiments.

FIGS. 13A and 13B illustrate tables relating to the identification ofcategory paths for the ID of FIG. 5 .

FIG. 14 is a flow diagram that illustrates the processing of an identifycandidate paths component of the categorization system in someembodiments.

FIG. 15 is a flow diagram that illustrates the processing of an identifytail match paths component of the categorization system in someembodiments.

FIG. 16 is a flow diagram that illustrates the processing of an identifyany match paths component of the categorization system in someembodiments.

FIG. 16A illustrates a table containing tail matched paths for text.

FIG. 17 is a flow diagram that illustrates the processing of an identifyancestor match component of the categorization system in someembodiments.

FIG. 17A illustrates a table containing scores for the candidate pathsof FIG. 16A.

FIG. 18 is a flow diagram that illustrates the processing of an adjustscores based on rules component of the categorization system in someembodiments.

FIG. 18A illustrates a table indicating the rule matches for candidatepaths.

FIG. 19 is a flow diagram that illustrates the processing of a selectcandidate paths based on word vectors component of the categorizationsystem in some embodiments.

FIG. 20 is a flow diagram that illustrates the processing of an assigncandidate paths component of the categorization system in someembodiments.

DETAILED DESCRIPTION

Methods and systems for assigning to an item a node of a taxonomy graphbased on an item description of the item as a categorization of the itemare provided. In some embodiments, a categorization system inputs anitem description and identifies candidate paths of a taxonomy graph thatare candidates to be assigned to the item. The categorization systemidentifies the candidate paths based on relevance of the labels of nodesderived from comparison of content of the item description to the labelsof the nodes of the taxonomy graph. When a label is determined to berelevant, the categorization system indicates that the paths from thenodes with that label to the root node of the taxonomy graph arecandidate paths and generates a relevance score for each candidate path.For each candidate path, the categorization system applies a classifierto the item description to identify classification paths and for eachclassification path, a classification score indicating relevance of theclassification path. For each candidate path, the categorization systemcombines the candidate score and the classification score of thecorresponding classification path to generate a final score for thatcandidate path. The classification system then assigns one or morecandidate paths with the highest final scores to the item.

DEFINITION OF TERMINOLOGY

Taxonomy graph A taxonomy graph is an acyclic hierarchical graph havingnodes with labels. Each node, except the root node, has one or moreparent nodes. A node may also have a set of associated variants andmatching rules. Label A label is text that contains one or more elements(e.g., words). For example, a label may be “apparel” or “active tops.” Alabel is unique in the taxonomy graph. Path A path is a node and itsancestor nodes. Item description An item description is a collection oftext that describes an item. An item description may include fields suchas a title field, a category field, and a text field. (The categoryfield in an item description is intrinsic to the environment in whichitem descriptions are specified, and not directly related to thedefinition of category given below). Variant A variant of a label is asynonym (i.e., a phrase with one or more elements) or another form ofthe label. For example, the label “active tops” may have the synonym“active tee” and other forms of “top” and “tops.” A label is a variantof itself. Variants may be specified by a user and generated byalgorithmic means. Unlike labels, variants are not unique. Extending theexample above, “tops” may be a variant assigned to nodes labeled “activetops” and another labeled “pajama tops”, as well as being a labelitself. Lookup set A lookup set for a variant maps that variant to eachpath with a leaf node for which that variant is a variant of the labelof that leaf node. For example, “top” is a variant associated with anode labeled “active tops” and is therefore mapped to all paths havingthe leaf node with the label “active tops.” Tail match A tail matchrefers to the longest tail portion of text that matches other text. Atail portion is a sequence of elements of the text that includes thelast element of the other text. For example, if the text is the label“red dress shoe” and the other text is “dress shoe,” then there is atail match that is “dress shoe.” In contrast, if the other text is “reddress,” then there is no tail match because “red dress” does not includethe last element “shoe.” Any match An any match refers to any portion oftext that matches the other text. For example, if the text is the label“red dress shoes” and the other text is “red dress,” then there is anany match because the text contains “red dress.” In contrast, if theother text is “slippers,” then there is no any match as “slippers” isnot anywhere in the text. Category A category is a hierarchical sequenceof one or more labels. Category node A category node is a node of thetaxonomy graph with a label that matches a label of a category.Candidate path A candidate path for an item is a path of the taxonomygraph that is a candidate to be assigned to item.

FIG. 1 illustrates a taxonomy graph. Each node 101-116 includes a label.For example, the label for node 107 is “active tops.” The paths thathave node 107 as a leaf node are path 101/103/107 and path 101/104/107.In path 101/103/107, nodes 101 and 103 are ancestor nodes. The parentnode of a child node represents a higher-level categorization of thechild node. For example, parent node 103 of child node 107 representsthat “dancewear” is a higher-level categorization of “active tops.”

FIG. 2 illustrates a taxonomy definition table defining the taxonomygraph of FIG. 1 . Each row of the table represents a node of thetaxonomy graph. For example, row 201 represents node 107. Each rowincludes the label of the node, the label(s) of the parent node(s), andthe variants of the label of the node. For example, row 201 has thelabels of “active tops”; the variants of “active top”, “tops”, and“top”; and the labels of its parent nodes “apparel” and “dancewear”. Thematch rules will be described later.

FIG. 3 illustrates a table indicating the paths of the taxonomy graph ofFIG. 1 . The table contains, for each node, a row for each path withthat node as a leaf node. Each row contains the label of the leaf nodeof the path and a specification of the path with that leaf node usingthe labels of the nodes of the path. For example, row 301 represents thepath 101/104/107, row 302 represents the path 101/103/107, and row 303represents the path 101/105/108/112. Each row also contains a pathidentifier. For example, the path identifier of row 302 is 16. Sincemultiple paths can have the same leaf node (e.g., paths of rows 301 and302), the label of a leaf node is not a unique identifier of a path.

FIG. 4 illustrates a table containing lookup sets generated from thetaxonomy definition table of FIG. 2 . The table includes a lookup setfor each variant of a label specified in the taxonomy definition table.Each lookup set maps the variant to each path with a leaf node that hasa label with that variant. For example, lookup set 401 represents the“active top” variant of the “active tops” label of node 107. Lookup set401 maps to paths with the path identifiers of 2 and 16, representingpaths 101/104/107 and 101/103/107, respectively. The lookup sets aremanaged in groups according to sequence length (being the number ofwords in the variant).

FIG. 5 illustrates an item description. The item description includesthe title “BEBE Sequin Logo Tee.” The item description includes categorysequences such as “athleisure” and “active tops.” Each text entry isdelimited by quotations.

FIG. 6 is a flow diagram that illustrates the high-level processing of acomponent that assigns a path of the taxonomy graph to an itemdescription in some embodiments. The assign node component 600 isprovided the item description (“ID”). In block 601, the componentgenerates scores for candidate paths for the ID based on features ofnodes matching content of the ID. The features of a node includevariants of the label of the node and matching rules associated with thenode. A score is a metric that indicates the likelihood that the path isthe best path to be assigned to an ID. In decision block 601A, if onlyone candidate path has a sufficiently high score, then furtherclassification processing is unnecessary and the component continues atblock 605, else the component continues at block 602. In block 602, thecomponent applies a classifier to the item description to generatescores for one or more classification paths. A classification path is apath identified by a classifier trained by the categorization system,chosen from the subset of paths for which training data was able to begenerated. In block 603, the component sets an overall score for eachcandidate path if any based on the score of that candidate path and thescore of its corresponding classification path if any. In block 604, thecomponent selects the candidate path or paths with an overall score thatindicates a likelihood that the candidate path should be assigned to theID. In block 605, the component assigns the candidate path(s) to the ID.

FIG. 7 is a flow diagram that illustrates high-level processing of acomponent that generates a classifier for the categorization system insome embodiments. A generate classifier component 700 is provided acollection of IDs and trains the classifier using training data derivedfrom those IDs. In blocks 701-704, the component loops identifyingcandidate paths for each ID. In block 701, the component selects thenext ID. In decision block 702, if all the IDs have already beenselected, then the component continues at block 705, else the componentcontinues at block 703. In block 703, the component identifies candidatepaths based on labels of nodes of the taxonomy graph matching content ofthe selected ID. In block 704, the component stores each candidate path,the ID, and optionally a score for the candidate path as the trainingdata. The component then loops to block 701 to select the next ID. Inblock 705, the component trains the classifier using the training dataand stores data learned (referred to as the model) via the training foruse when classifying a path. In block 706-708, the component loopsgenerating a normalization threshold for the nodes of the taxonomygraph. Since the labels of nodes are not independent and the trainingdata may be unevenly distributed across the nodes, the classificationscores may not be comparable between nodes. To account for theclassifications scores not being comparable, the component generates anormalization threshold for normalizing the classification scores. Inblock 706, the component selects the next node of the taxonomy graph. Indecision block 707, if all the nodes have already been selected, thenthe component completes, else the component continues at block 708. Inblock 708, the component generates the normalization threshold for theselected node and stores the normalized threshold for use whenclassifying a path and loops to block 706 to select the next node.

FIG. 8 is a block diagram illustrating components of the categorizationsystem in some embodiments. The categorization system 800 includescomponents 801-812 and data stores 821-823. The preprocess graphcomponent 801 preprocesses a taxonomy graph to generate lookup sets. Agenerate classifier component 802 receives item descriptions andgenerates a classifier based on those item descriptions. A generatetraining data component 803 generates the training data for training theclassifier. An identify category paths (“catPaths”) component 804identifies the category paths for an item description. An identifycandidate paths (“canPaths”) component 805 identifies candidate pathsfor an item description. A select candidate paths based on a word vector(“WV”) component 806 selects candidate paths based on a word vectorcomparison. An identify tail match path component 807 identifiescandidate paths based on matching the tail portion of text. An identifyany match path component 808 identifies candidate paths based onmatching any portion of text. An identify ancestor matches component 809identifies an ancestor path of a path that matches a target path. Anadjust scores based on rules component 810 applies a set of rules tocandidate paths to adjust the scores of the candidate paths. An assigncandidate paths component 811 is provided an ID for an item and assignscandidate paths to that item. A generate lookup sets component 812generates lookup sets for a taxonomy graph. A taxonomy graph store 821stores the definition of a taxonomy graph. A training data store 822stores training data for use in training a classifier. The training dataincludes mappings of IDs to candidate paths. A classifier store 823stores learned data (e.g., weights) for the classifier generated duringthe generation of the classifier.

The computing systems (e.g., network nodes or collections of networknodes) on which the categorization system and the other describedsystems may be implemented may include a central processing unit, inputdevices, output devices (e.g., display devices and speakers), storagedevices (e.g., memory and disk drives), network interfaces, graphicsprocessing units, cellular radio link interfaces, global positioningsystem devices, and so on. The input devices may include keyboards,pointing devices, touch screens, gesture recognition devices (e.g., forair gestures), head and eye tracking devices, microphones for voicerecognition, and so on. The computing systems may includehigh-performance computing systems, cloud-based servers, desktopcomputers, laptops, tablets, e-readers, personal digital assistants,smartphones, gaming devices, servers, and so on. For example, thesimulations and training may be performed using a high-performancecomputing system, and the classifications may be performed by a tablet.The computing systems may access computer-readable media that includecomputer-readable storage media and data transmission media. Thecomputer-readable storage media are tangible storage means that do notinclude a transitory, propagating signal. Examples of computer-readablestorage media include memory such as primary memory, cache memory, andsecondary memory (e.g., DVD) and other storage. The computer-readablestorage media may have recorded on them or may be encoded withcomputer-executable instructions or logic that implements thecategorization system and the other described systems. The datatransmission media are used for transmitting data via transitory,propagating signals or carrier waves (e.g., electromagnetism) via awired or wireless connection. The computing systems may include a securecryptoprocessor as part of a central processing unit for generating andsecurely storing keys and for encrypting and decrypting data using thekeys.

The categorization system and the other described systems may bedescribed in the general context of computer-executable instructions,such as program modules and components, executed by one or morecomputers, processors, or other devices. Generally, program modules orcomponents include routines, programs, objects, data structures, and soon that perform tasks or implement data types of the categorizationsystem and the other described systems. Typically, the functionality ofthe program modules may be combined or distributed as desired in variousexamples. Aspects of the categorization system and the other describedsystems may be implemented in hardware using, for example, anapplication-specific integrated circuit (“ASIC”) or field programmablegate array (“FPGA”).

FIG. 9 is a flow diagram that illustrates the processing of a preprocessgraph component of the categorization system in some embodiments. Thepreprocess graph component 900 generates lookup sets sorted by number ofelements in the variants that are mapped to paths. In block 901, thecomponent identifies the paths of the taxonomy graph. In block 902, thecomponent invokes a generate lookup sets component to generate thelookup sets based on the identified paths. In block 903, the componentsorts the lookup sets based on number of elements in the variants andthen completes.

FIG. 10 is a flow diagram that illustrates the processing of a generatelookup sets component of the categorization system in some embodiments.The generate lookup sets component 1000 generates a lookup set for eachvariant of a label of a taxonomy graph. In block 1001, the componentselects the next node of the taxonomy graph. In decision block 1002, ifall the nodes have already been selected, then the component completes,else the component continues at block 1003. In block 1003, the componentselects the next variant of the label of the selected node. In decisionblock 1004, if all the variants have already been selected, then thecomponent loops to block 1001 to select the next node, else thecomponent continues at block 1005. In block 1005, the component creates(or updates if already created) a label set for the selected variantthat maps the selected variant to each path that has the selected nodeas its leaf node. The component then loops to block 1003 to select thenext variant. FIG. 4 illustrates the lookup sets for the taxonomydefinition table of FIG. 2 .

FIG. 11 is a flow diagram that illustrates the processing of a generateclassifier component of the categorization system in some embodiments.The generate classifier component 1100 accesses training data and trainsthe classifier and generates normalization thresholds. In block 1101,the component selects the next ID to candidate path mapping of thetraining data. In decision block 1102, if all the mappings have alreadybeen selected, then the component continues at block 1104, else thecomponent continues at block 1103. In block 1103, the component trainsthe classifier based on the selected mapping and then loops to block1101 to select the next mapping. In block 1104, the component selectsthe next item description of the training data. In decision block 1105,if all the item descriptions have already been selected, then thecomponent continues at block 1107, else the component continues at block1106. In block 1106, the component applies the classifier to assign anode to the item description and generates a score for the assignment.The component then loops to block 1104 to select the next itemdescription. In block 1107, the component selects the next node of thetaxonomy graph. In decision block 1108, if all the nodes have alreadybeen selected, then the component completes, else the componentcontinues at block 1109. In block 1109, the component calculates theclassification threshold for the selected node and loops to block 1107to select the next node.

The classifier model may be trained by iterating through all of the itemdescriptions and labels stored in the item repository and providingthese as inputs to a suitable multi-label classification algorithm suchas a Random Forest Classifier. There are numerous classificationalgorithms in the scientific literature and available in scientificsoftware libraries; any algorithm that is able to perform multi-labelclassification (that is, to generate potentially more than one labelprediction for a given input, where the predicted labels either exceed adefined level of confidence or are provided with a confidence indicator)is suitable. Since the taxonomy paths form a hierarchy, a hierarchicalmultilabel classification algorithm may be used, or each taxonomy pathmay be treated as an independent label and a non-hierarchical algorithmsuch as the Random Forest Classifier may be employed. Non-hierarchicalalgorithms may be adapted in other ways. (See, Cerri, R., Carvalho, A.C., & Freitas, A. A., “Adapting Non-Hierarchical MultilabelClassification Methods for Hierarchical Multilabel Classification,”Intelligent Data Analysis, 15(6), 861-887, 2011.)

To generate the normalization thresholds for the nodes, the generateclassifier component uses the newly trained classifier to generateclassification scores for the training data. The component thenidentifies the classification scores for a node associated with correctclassifications and the classification scores for that node associatedwith incorrect classifications. The component then sets thenormalization threshold for the node to the classification score forwhich the number of correct predictions with classification scores belowthat classification score is equal to the number of incorrectpredictions with classification scores above that classification score.If a classification score is equal to this normalization threshold, theprobability of the classification being correct is 0.5. The componentscales the original classification scores for a node. The scaledclassification score is a probability between 0 and 1 and generatedbased on the formula:p=e ^(λ(S-S) ^(T) ⁾where p is the probability, S is the original classification score,S_(T) is the normalization threshold, and λ is a scaling constant. Whenλ=4.5 the scaling is such that if S_(T)=0.5, then p=0.2 when S=0.2 andp=0.8 when S=0.8. In this way, when the original classification scorerepresents a probability with a normalization threshold 0.5, then theoriginal classification score is approximately preserved in theresulting scaled classification score.

FIG. 12 is a flow diagram that illustrates the processing of a generatetraining data component of the categorization system in someembodiments. The generate training data component 1200 is provided IDsand generates the training data using those IDs. In block 1201, thecomponent selects the next ID. In decision block 1202, if all the IDshave already been selected, then the component completes, else thecomponent continues at block 1203. In block 1203, the componentnormalizes and augments the content of the selected ID. In block 1204,the component invokes an identify category paths component to identifythe category paths for the selected ID. In block 1205, the componentinvokes an identify candidate paths component to identify candidatepaths for the selected ID. In decision block 1205A, if only onecandidate path was identified, then the component continues at block1207, else the component continues at block 1206. In block 1206, thecomponent invokes a select candidate path based on word vector componentto select a result path for the training data. In some embodiments,multiple result paths may be selected based on their scores having avalue above a threshold. In block 1207, the component stores anID-to-result path mapping as training data and then loops to block 1201to select the next ID.

The component normalizes the content of an ID using techniques such asconverting to lower case, replacing punctuations and symbols withspaces, removing accent marks, and removing redundant whitespace. Thisnormalization helps to reduce the number of elements (i.e., words) withidentical meanings. For example, the elements “what's”, “whats”, “Whats”and “What's” are all commonly occurring variations of an element meaning“what is.” These elements may be normalized to “whats” resulting in oneelement rather than four elements having the same meaning. The componentaugments content of the ID with relevant variants to increase theopportunities for matching. For example, missing plural or singularforms of nouns are added to the list of categories in the ID.

FIG. 13 is a flow diagram that illustrates the processing of an identifycategory paths component of the categorization system in someembodiments. The identify category paths component is provided an ID andthen identifies candidate paths for the item with that ID. In block1301, the component selects the next label of the categories of the ID.In decision block 1302, if all the labels have already been selected,then the component continues at block 1304, else the component continuesat block 1303. In block 1303, the component invokes the identify tailmatch paths component to identify candidate paths with leaf nodes thathave a label that is a tail match of the selected label. In block 1304,the component designates the leaf nodes of the category paths ascategory nodes and then completes.

FIGS. 13A and 13B illustrate tables relating to the identification ofcategory paths for the ID of FIG. 5 . Row 1310 illustrates that the“brands” category of the ID does not match the label of any nodes of thetaxonomy graph. Row 1320 illustrates that the “active tops” category ofthe ID matches the label of the leaf nodes of paths 2 and 16, which arethus designated as category paths. FIG. 13B lists just the categorypaths.

FIG. 14 is a flow diagram that illustrates the processing of an identifycandidate paths component of the categorization system in someembodiments. The identify candidate paths component 1400 is provided anID and category paths and identifies candidate paths for that ID. Inblock 1401, the component sets a base score for the candidate paths. Inblock 1402, the component invokes an identify tail match paths componentpassing an indication of the title of the ID. The identify tail matchpaths component identifies as candidate paths those paths with a leafnode that has a label that is a tail match on the title. In decisionblock 1403, if a candidate path is found, then the component continuesat block 1407, else the component continues at block 1404. In block1404, the component invokes an identify any match paths componentpassing an indication of the title of the ID. The identify any matchpaths component identifies as candidate paths those paths with a leafnode that have a label that is an any match on the title. In block 1405,the component further reduces the base score to reflect that an anymatch is a less relevant match than a title match. In decision block1406, if a candidate path was found, then the component continues atblock 1407, else the component completes indicating that no candidatepath was found. In block 1407, the component invokes an identifyancestor match component passing an indication of the category paths andthe candidate paths to identify and score each candidate path with asub-path that matches a category path. In block 1408, the componentinvokes the identify ancestor match component passing an indication ofthe candidate paths and category paths to identify and score eachcategory path with a sub-path that matches a candidate path. In decisionblock 1409, if an ancestor path was found, then the component continuesat block 1410, else the component continues at block 1411. In block1410, the component sets the candidate paths to the combination of theancestor candidate paths and the ancestor category paths. In block 1411,the component sets the score of the candidate paths to a reduced basescore. In block 1412, the component invokes an adjust score based onrules component to adjust the scores of the candidate path based onrules defined for the nodes of the taxonomy graph and then completesidentifying the candidate paths.

FIG. 15 is a flow diagram that illustrates the processing of an identifytail match paths component of the categorization system in someembodiments. The identify tail match paths component 1500 is providedtext (e.g., a title) and searches the lookup sets for the longestvariant (i.e., most elements) that matches the longest tail of the text.In block 1501, the component initializes the candidate paths to empty.In block 1502, the component selects the next group in the lookup setsstarting with groups with the longest length that is less than or equalto the length of the text and proceeding to groups with the shortestlengths. In block 1502A, the component sets variable L to the sequencelength of the group. In decision block 1503, if all groups have alreadybeen selected, then the component completes identifying the candidatepaths, else the component continues at block 1504. In decision block1504, if the last L words of the text appear in the lookup set group,then the component continues at block 1505, else the component loops toblock 1502 to select the next variant. In block 1505, the component setsthe candidate paths to the candidate paths associated with the selectedvariant that is the last L words of the text and then completesidentifying the candidate paths.

FIG. 16 is a flow diagram that illustrates the processing of an identifyany match paths component of the categorization system in someembodiments. The identify any match paths component is provided text andsearches the lookup sets for the longest variant that matches a portionof the text. In block 1601, the component initializes the candidatepaths to empty. In block 1602, the component selects the next group inthe lookup sets starting with groups with the longest length that isless than or equal to the length of the text and proceeding to groupswith the shortest lengths. In block 1602A, the component sets variable Lto the sequence length of the group. In decision block 1603, if allgroups have already been selected, then the component completesidentifying the candidate paths, else the component continues at block1604. In block 1604, the component initializes variables i for selectingportions of the text. In decision block 1605, the component incrementsvariable i and if variable i is greater than the length of the textminus variable L, then the selected variant is not an any match for thetext and the component loops to block 1602 to select the next group,else the component continues at block 1606. In block 1606, if theportion of the text delimited by variables i and i+L−1 is present in thelookup set group, then component continues at block 1607, else thecomponent loops to block 1605 to select the portion of the text startingwith the next element of the text. In block 1607, the component adds tothe candidate paths the candidate paths associated with the selectedvariant that is the portion of the text delimited by variables i andi+L−1 and then loops to block 1602 to select the next group.

FIG. 16A illustrates a table containing tail matched paths for text. Thetext is the title “BEBE sequin logo tee.” The component searches thelookup sets (FIG. 4 ) for a variant that matches the tail portions ofthe text, starting with the lookup set groups for sequence length 3words with “sequin logo tee” where it finds no match, then sequencelength 2 words with “logo tee” where it also finds no match, and finallysequence length 1 word with “tee” where it does find a match. Table ofFIG. 16A lists the paths of lookup sets of those matching variants.

FIG. 17 is a flow diagram that illustrates the processing of an ancestormatch component of the categorization system in some embodiments. Theancestor component 1700 is provided paths X and paths Y and identifies,for each pair of a path X and a path Y, the distance of the ancestormatch of path Y if any. The ancestor match of path Y is a partial pathof path Y that matches path X. For example, if path X is A/B/C and pathY is A/B/C/D/E, then path Y has an ancestor match given path X becauseA/B/C is a partial path of A/B/C/D/E. In contrast, the A/C/D/E is not apartial path of A/B/C/D/E. The distance of the ancestor match of path Yis the depth of path Y minus the depth of the partial path. Continuingwith the example, the depth of path Y is 5 and the depth of the partialpath is 3, therefore the distance of path Y is 2.

In block 1701, the component selects the next path Y. In decision block1702, if all the paths Y have already been selected, the componentcontinues at block 1709, else the component continues at block 1703. Inblock 1703, the component selects the next path X. In decision block1704, if all the paths X have already been selected for the selectedpath Y, then the component loops to block 1701 to select the next pathY, else the component continues at block 1705. In block 1705, thecomponent determines whether path X is a partial path of path Y. Indecision block 1706, if it is a partial path, then the componentcontinues at block 1707, else the component loops to block 1703 toselect the next path X. In block 1707, the component records thedistance of the ancestor match. In block 1708, the component marks pathY as having an ancestor match and then loops to block 1703 to select thenext path X. In block 1709, the component sets the scores for the pathsY with an ancestor match if any based on the distances of each path Yand then completes indicating the paths Y with a partial match.

The ancestor match component identifies whether a first path has apartial path that matches a second path, referred to as an ancestormatch. If there is one or more ancestor matches, then the score of thefirst path is set. The component first calculates the distance of thepartial match, which is the depth of the first path minus the depth ofthe second path. For example, if the first path has depth of 5 and thesecond path has a depth of 3, then the distance for the first path is 2.The component then generates a multiplier for the base score of thefirst path. The ancestor matching is performed for multiple secondpaths. The component sets the multiplier based on the following formula:γ(1+log(n+ƒ(d)))where γ=0.9 if multiple first paths have an ancestor match, otherwiseγ=1.0 is a score confidence adjustment due to multiple first pathshaving ancestor matches, n is the number of second paths found, ƒ(d) isa function of the distances, and the log is a logarithmic function(e.g., natural log). The function ƒ can be any of a variety of functionsuch as the min, max, or averages of the distances. The log functioncompresses the multiplier given to first paths that have a greaterdistance and the weight given to having multiple second paths. Theformula thus provides a small positive bias to first paths with a largerdistance because ancestor matches at more distant nodes (i.e.,hypernyms) are more likely to use a different vocabulary and aretherefore stronger evidence of relevance. The component gives the firstpath a score that is the product of the base score and the multiplierfor that first path.

FIG. 17A illustrates a table containing scores for the candidate pathsof FIG. 16A. In this example, the first paths are the candidate pathsand the second paths are the category paths. The candidate path 3 has anancestor match with category paths 0, 1, and 2 with the distances of thepartial matches being 3, 2 and 1 respectively. Applying the formula witha maximum function and using γ=0.9, the multiplier for candidate path 3is 2.51.

FIG. 18 is a flow diagram that illustrates the processing of adjustingscores based on rules component of the categorization system in someembodiments. The adjust scores based on rules component 1800 is providedcandidate paths and the ID and adjusts the scores of each candidate pathbased on rules associated with all of the nodes of each candidate path.In block 1801, the component selects the next candidate path. Indecision block 1802, if all the candidate paths are already selected,then the component continues at block 1807, else the component continuesat block 1803. In block 1803, the component selects the next rule of anode of the candidate path. In decision block 1804, if all the rules arealready selected, then the component loops to block 1801 to select thenext candidate path, else the component continues at block 1805. Indecision block 1805, if the field content test of the rule is satisfiedfor at least one of fields matching the field name filter, then thecomponent continues at block 1806, else the component loops to block1803 to select the next rule. In block 1806, the component adjusts thescore of the candidate path by the multiplier of the rule and then loopsto block 1803 to select the next rule. In block 1807, the component setsthe candidate paths to those candidate paths with the score above zeroand then completes indicating the candidate paths.

FIG. 18A illustrates a table indicating the rule matches for candidatepaths. The candidate paths are the candidate paths of FIG. 17A. Thecandidate paths 9 and 17 are illustrated as not matching a rule.Candidate path 3 has rules 1820 associated with the node with label“activewear” as illustrated in FIG. 2 . The first rule has a field namefilter of “!any” and a field content test of “activelathletisport.” Thefirst rule means that if not any of the fields of the ID contain“active,” “athlet,” or “sport,” then the multiplier is 0.0. The secondrule means that if any of the fields of the ID contain “active,”“athlet,” or “sport,” then the multiplier is 1.5. The third rule meansthat if the title or category fields of the ID contain “baby,” “infant,”or “toddler,” then the multiplier is 0.0. The second rule is satisfiedbecause the content (i.e., category and text) includes “active” in thephrase “active top.” As a result, the score 2.51 of candidate path 3 ismultiplied by 1.5 resulting in a score of 3.76. If multiple rules apply,a function (e.g., max, product) can be applied to identify themultiplier for the candidate path.

FIG. 19 is a flow diagram that illustrates the processing of a selectcandidate paths based on word vectors component of the categorizationsystem in some embodiments. The component 1900 is provided candidatepaths and an ID and adjusts the score of the candidate paths based onsimilarity between a word vector representing the ID and a word vectorfor each candidate path. In block 1901, the component generates the wordvector for the item description. In block 1902, the component selectsthe next candidate path. In block decision block 1903, if all thecandidate paths have already been selected, then the componentcompletes, else the component continues at block 1904. In block 1904,the component generates the word vector for the candidate path. In block1905, the component sets the score of the candidate path to the currentscore times a similarity factor based on similarity of the word vectorsand then loops to block 1902 to select the next candidate path.

Word vectors are numerical vector representations of words which capturesyntactic and semantic regularities in language. An example of such isthat given vectors for words “King”, “Man”, and “Woman”, the vectorarithmetic King−Man+Woman produces a vector which is similar to that ofthe word vector for the word “Queen”. The similarity of word vectors canbe determined using a cosine similarity function as represented by thefollowing formula:

${similarity} = \frac{a \cdot b}{{ab}}$where a and b are the word vectors and similarity is the cosine of theangle between the word vectors. Since similarity is monotonic with arange of [−1,1], a similarity of 1 representing best match and asimilarity of −1 representing a diametrically opposed match. Thecategorization system may use similarity as the multiplier. If thesimilarity is 0, the two vectors are orthogonal and are therefore likelyto be semantically unrelated. Thus, the categorization system may employa score threshold greater than 0 for acceptance of candidate paths toensure with high probability that only relevant candidate paths areretained.

When multiple candidate paths have leaf nodes with the same label, thecategorization system weights labels of ancestor nodes higher than thelabel of the leaf node. The generating of a vector that increases theweight for each higher-level ancestor node is represented by thefollowing equation:

$a = {{\omega(N)} + {\sum\limits_{i}{\alpha^{i}{\omega\left( P_{i} \right)}}}}$where ω is a function that converts a word to a vector, N is the leafnode label, P_(i) is the ancestor node label at depth i (with the depthindex as previously described) and α is the weight multiplier per level,nominally set to 2.5. In this way the labels of higher ancestor nodeshave a greater influence on the vector direction. The categorizationsystem creates function ω as a preprocessing step using algorithms suchas that of Gensim. (See, Řehůřek, Radim, “Scalability of SemanticAnalysis in Natural Language Process,” Ph.D. Thesis, Faculty ofInformatics, Masaryk University, May, 2011.) The vector b is the sum ofthe word vectors for the collection of the words in the content of theID.

The categorization system generates a word vector based on a word vectormodel. Prior to performing any categorization, the categorization systembuilds a word vector model, for example, using a corpus of documents.The collection of IDs themselves may provide a suitable corpus. When thecorpus is sufficiently large, the word vector model will producesatisfactory results, even if new items are introduced as long as thenew items use a similar vocabulary to items on which the word vectormodel was trained. If items with significant new vocabulary areintroduced, then the word vector model may need to be retrained.

FIG. 20 is a flow diagram that illustrates the processing of an assigncandidate paths component of the categorization system in someembodiments. The assign candidate path component 2000 is passed anindication of an ID of an item and identifies candidate paths so thatthe candidate path with the highest score(s) can be assigned to thatitem. In block 2001A, the component normalizes and augments the contentof the selected ID. In block 2001B, the component invokes an identifycategory paths component passing an indication of the ID to identify thecategory paths. In block 2001C, the component invokes an identifycandidate paths component passing an indication of the ID to identifyinitial candidate paths. In decision block 2002, if one initialcandidate path was identified, then the component completes indicatingthat the initial candidate path is to be assigned to the item, else thecomponent continues at block 2003. In block 2003, the component appliesthe classifier that was previously trained to the ID to generatealternate candidate paths. In decision block 2004, if the number ofinitial candidate path is zero, then the component continues at block2006, else the component continues at block 2005. In block 2005, thecomponent generates scores for the initial candidate paths. Thecomponent augments the set of alternate candidate paths with eachinitial candidate path that is not an alternate candidate, given each adefault value that is small (e.g., 0.01). The component sets the scoreof each alternate candidate path to a combined score of the scores ofthat initial candidate path and the corresponding alternate candidatepath. The component may set the combined score to a weighted sum of thescores. For example, the combined score may be a weighted average of thescores with the score of the alternate candidate path having a weight offour times the weight of the initial candidate path. In block 2006, thecomponent invokes an adjust scores based on rules component passing thealternate candidate paths to adjust the scores of the alternatecandidate paths based on the rules. In block 2007, the component setsthe candidate paths to the alternate candidate paths with a score higherthan the threshold score and completes indicating the candidate paths.

The following paragraphs describe various embodiments of aspects of thecategorization system. An implementation of the categorization systemmay employ any combination of the embodiments. The processing describedbelow may be performed by a computing system with a processor thatexecutes computer-executable instructions stored on a computer-readablestorage medium that implements the categorization system.

In some embodiments, a method performed by a computing system isprovided for generating training data for training a classifier toassign nodes of a taxonomy graph to items based on item descriptions.Each node has a label. The method accessing item descriptions of itemsand the taxonomy graph. For each item, the method identifies for thatitem one or more candidate paths within the taxonomy graph that arerelevant to that item. The candidate paths is identified based oncontent of the item description of that item matching a feature of thenodes. A candidate path is a sequence of nodes starting from a root nodeof the taxonomy graph. For each identified candidate path, the methodlabels the item description with the candidate path. The labeled itemdescriptions compose the training data. In some embodiments, an itemdescription includes a title and one or more category labels. Theidentifying of the candidate paths for an item includes identifying oneor more candidate paths that have a leaf node with a label that matchesa tail portion of the title of the item description for that item. Theidentifying further includes for each identified candidate path,generating a relevance score for that candidate path; determiningwhether the candidate path includes a partial path that is a categorypath; and upon determining that the candidate path includes a partialpath that is a category path, adjusting the relevance score of thatcandidate path to indicate an increased relevance. A category path is apath of the taxonomy graph with a leaf node with that has a label thatmatches a category label. The method discards candidate paths with arelevance score that does not satisfy a threshold relevance. In someembodiments, the identifying of the candidate paths further comprises,prior to discarding the candidate paths, for each candidate path,determining whether a category path includes a partial path that is thecandidate path; and upon determining that a category path includes apartial path that is the candidate path, adjusting the relevance scoreof that candidate path to indicate an increased relevance. In someembodiments, the method further determines whether no candidate path hasa leaf node with a label that matches a tail portion of the title of theitem description for that item. Upon determining that no candidate pathhas such a leaf node, the method identifies one or more candidate pathsthat have a leaf node with a label that matches any portion of the titleof the item description for that item. In some embodiments, an itemdescription includes fields having field names and having field content.Prior to labeling, the method accesses accessing rules associated with anode and having a field name test, a content test, and a relevanceadjustment. For each candidate path, for each node along the candidatepath, for each rule associated with that node, for each field with fieldname satisfies the field name test of that rule and with field contentthat satisfies the field content test of that rule, the method adjuststhe relevance score of that candidate path based on the relevanceadjustment of that rule. In some embodiments, prior to labeling, themethod determines whether multiple candidate paths have been identified.Upon determining that multiple candidate paths have been identified, foreach candidate path, the method generates a similarity score for thatcandidate path based on similarity between a word vector for the itemdescription and a word vector for the candidate path. The methoddiscards candidate paths other than the candidate paths with thesimilarity score above a threshold. In some embodiments, the methodtrains a classifier using the labeled item descriptions as trainingdata. In some embodiments, for each item, the method applies theclassifier to the item description of that item to assign aclassification label and a classification score to that item. F eachclassification label, the method a label threshold for theclassification label based on the classification scores of items withthat classification label. The label threshold is based on accuracy ofthe classifier at assigning the classification label.

In some embodiments, a method performed by a computing system isprovided for identifying a path of a taxonomy graph to assign to an itemdescription of an item. The method identifies one or more candidatepaths within the taxonomy graph that are relevant to the item. Themethod identifies the candidate paths based on content of the itemdescription of the item matching labels of nodes. Each candidate pathhas a relevance score. A path is a sequence of nodes starting from aroot node of the taxonomy graph. The method applies a classifier to theitem description to identify classification paths that each has aclassification score. For each candidate path, the method generates afinal relevance score for that candidate path by combining the relevancescore for that candidate path with the classification score for thecorresponding classification path if any. The method assigns to the itemone or more candidate paths with a final relevance score indicating thatthe candidate path is relevant to the item. In some embodiments, an itemdescription includes a title and one or more category labels. The methodidentifies the candidate paths for an item as follows. The methodidentifies one or more candidate paths that have a leaf node with alabel that matches a tail portion of the title of the item descriptionfor that item. For each identified candidate path, the method generatesa relevance score for that candidate path and determines whether thecandidate path includes a partial path that is a category path. Acategory path is a path of the taxonomy graph with a leaf node that hasa label that matches a category label. Upon determining that thecandidate path includes a partial path that is a category path, themethod adjusts the relevance score of that candidate path to indicate anincreased relevance. The method discards candidate paths with arelevance score that does not satisfy a threshold relevance. In someembodiments, the method further, prior to discarding the candidatepaths, for each candidate path, determines whether a category pathincludes a partial path that is the candidate path. Upon determiningthat a category path includes a partial path that is the candidate path,the method adjusts the relevance score of that candidate path toindicate an increased relevance. In some embodiments, the methoddetermines whether no candidate path has a leaf node with a label thatmatches a tail portion of the title of the item description for thatitem. Upon determining that no candidate path has such a leaf node, themethod identifies one or more candidate paths that have a leaf node witha label that matches any portion of the title of the item descriptionfor that item. In some embodiments, an item description includes fieldshaving field names and having field content. The method further, priorto selecting the candidate paths, accesses rules associated with nodes.Each rule has a field name test, a content test, and a relevanceadjustment. For each candidate path, for each node along the candidatepath, for each rule associated with that node, for each field with fieldname satisfies the field name test of that rule and with field contentthat satisfies the field content test of that rule, the method adjuststhe relevance score of that candidate path based on the relevanceadjustment of that rule. In some embodiments, for each candidate path,the method adjusts the relevance score of the candidate path based on anormalization threshold for the classification label of that candidatepath, the normalization threshold based on accuracy of the classifier atassigning the classification label.

In some embodiments, one or more computing systems are provided foridentifying a path of a taxonomy graph to assign to an item descriptionof an item. The one or more computing systems comprises one or morecomputer-readable storage mediums storing computer-executableinstructions for controlling the one or more computing systems and oneor more processors for executing the computer-executable instructionsstored in the one or more computer-readable storage mediums. Theinstructions identify one or more candidate paths within the taxonomygraph that are relevant to the item. Each candidate path has a relevancescore. The instructions apply a classifier to the item description toidentify classification paths. Each classification path having aclassification score. For each candidate path, the instructions generatea final relevance score for that candidate path by combining therelevance score for that candidate path with the classification scorefor the corresponding classification path. The instructions assign tothe item one or more candidate paths with a final relevance scoreindicating that the candidate path is relevant to the item. In someembodiments, an item description includes a title and one or morecategory labels and the instructions that identify of the candidatepaths for an item identify one or more candidate paths that have a leafnode with a label that matches a tail portion of the title of the itemdescription for that item. For each identified candidate path, theinstructions generate a relevance score for that candidate path. Theinstructions determine whether the candidate path includes a partialpath that is a category path. A category path is a path of the taxonomygraph with a leaf node that has a label that matches a category label.Upon determining that the candidate path includes a partial path that isa category path, the instructions adjust the relevance score of thatcandidate path to indicate an increased relevance. The instructionsdiscard candidate paths with a relevance score that does not satisfy athreshold relevance. In some embodiments, the instructions that identifyof the candidate paths for an item, prior to discarding the candidatepaths, for each candidate path, determine whether a category pathincludes a partial path that is the candidate path. Upon determiningthat a category path includes a partial path that is the candidate path,the instructions adjust the relevance score of that candidate path toindicate an increased relevance. In some embodiments, the instructionsfurther determine whether no candidate path has a leaf node with a labelthat matches a tail portion of the title of the item description forthat item. Upon determining that no candidate path has such a leaf node,the instructions identify one or more candidate paths that have a leafnode with a label that matches any portion of the title of the itemdescription for that item. In some embodiments, an item descriptionincludes fields having field names and having field content and theinstructions further, prior to selecting the candidate paths, accessrules associated with nodes. Each rule has a field name test, a contenttest, and a relevance adjustment. For each candidate path, for each nodealong the candidate path, for each rule associated with that node, foreach field with field name satisfies the field name test of that ruleand with field content that satisfies the field content test of thatrule, instructions adjust the relevance score of that candidate pathbased on the relevance adjustment of that rule. In some embodiments, theinstructions, for each candidate path, adjust the relevance score of thecandidate path based on a normalization threshold for the classificationlabel of that candidate path. The normalization threshold is based onaccuracy of the classifier at assigning the classification label.

Although the subject matter has been described in language specific tostructural features and/or acts, it is to be understood that the subjectmatter defined in the appended claims is not necessarily limited to thespecific features or acts described above. Rather, the specific featuresand acts described above are disclosed as example forms of implementingthe claims. Accordingly, the invention is not limited except as by theappended claims.

We claim:
 1. A method performed by a computing system for generatingtraining data for training a classifier to assign nodes of a taxonomygraph to items based on item descriptions, each node having a label, themethod comprising: accessing item descriptions of items and the taxonomygraph, wherein an item description includes a title and one or morecategory labels; for each item, identifying for that item one or morecandidate paths within the taxonomy graph that are relevant to thatitem, the candidate paths identified based on content of the itemdescription of that item matching a feature of the nodes, a candidatepath being a sequence of nodes starting from a root node of the taxonomygraph, wherein the identifying of the candidate paths for an itemcomprises: identifying one or more candidate paths that have a leaf nodewith a label that matches a tail portion of the title of the itemdescription for that item; for each identified candidate path, generating a relevance score for that candidate path;  determiningwhether the candidate path includes a partial path that is a categorypath, a category path being a path of the taxonomy graph with a leafnode that has a label that matches a category label; and  upondetermining that the candidate path includes a partial path that is acategory path, adjusting the relevance score of that candidate path toindicate an increased relevance; and discarding candidate paths with arelevance score that does not satisfy a threshold relevance; and foreach identified candidate path, labeling the item description with thecandidate path, wherein the labeled item descriptions compose thetraining data.
 2. The method of claim 1 wherein the identifying of thecandidate paths further comprises, prior to discarding the candidatepaths: for each candidate path, determining whether a category pathincludes a partial path that is the candidate path; and upon determiningthat a category path includes a partial path that is the candidate path,adjusting the relevance score of that candidate path to indicate anincreased relevance.
 3. The method of claim 1 further comprising:determining whether no candidate path has a leaf node with a label thatmatches a tail portion of the title of the item description for thatitem; upon determining that no candidate path has such a leaf node,identifying one or more candidate paths that have a leaf node with alabel that matches any portion of the title of the item description forthat item.
 4. The method of claim 1 wherein an item description includesfields having field names and having field content and furthercomprising, prior to labeling: accessing rules, each rule associatedwith a node, each rule having a field name test, a content test, and arelevance adjustment; for each candidate path, for each node along thecandidate path, for each rule associated with that node, for each fieldwith field name satisfies the field name test of that rule and withfield content that satisfies the field content test of that rule,adjusting the relevance score of that candidate path based on therelevance adjustment of that rule.
 5. The method of claim 1 furthercomprising, prior to labeling: determining whether multiple candidatepaths have been identified; and upon determining that multiple candidatepaths have been identified, for each candidate path, generating asimilarity score for that candidate path based on similarity between aword vector for the item description and a word vector for the candidatepath; and discarding candidate paths other than the candidate paths withthe similarity score above a threshold.
 6. The method of claim 1 furthercomprising training a classifier using the labeled item descriptions astraining data.
 7. The method of claim 6 further comprising: for eachitem, applying the classifier to the item description of that item toassign a classification label and a classification score to that item;for each classification label, calculating a label threshold for theclassification label based on the classification scores of items withthat classification label, the label threshold based on accuracy of theclassifier at assigning the classification label.
 8. A method performedby a computing system for identifying a path of a taxonomy graph toassign to an item description of an item, the method comprising:identifying one or more candidate paths within the taxonomy graph thatare relevant to the item, the candidate paths identified based oncontent of the item description of the item matching labels of nodes,each candidate path having a relevance score, a path being a sequence ofnodes starting from a root node of the taxonomy graph; applying aclassifier to the item description to identify classification paths,each classification path having a classification score; for eachcandidate path, generating a final relevance score for that candidatepath by combining the relevance score for that candidate path with theclassification score for the corresponding classification path if any;and assigning to the item one or more candidate paths with a finalrelevance score indicating that the candidate path is relevant to theitem.
 9. The method of claim 8 wherein an item description includes atitle and one or more category labels and wherein the identifying of thecandidate paths for an item comprises: identifying one or more candidatepaths that have a leaf node with a label that matches a tail portion ofthe title of the item description for that item; for each identifiedcandidate path, generating a relevance score for that candidate path;determining whether the candidate path includes a partial path that is acategory path, a category path being a path of the taxonomy graph with aleaf node that has a label that matches a category label; and upondetermining that the candidate path includes a partial path that is acategory path, adjusting the relevance score of that candidate path toindicate an increased relevance; and discarding candidate paths with arelevance score that does not satisfy a threshold relevance.
 10. Themethod of claim 9 wherein the identifying of the candidate paths furthercomprises, prior to discarding the candidate paths: for each candidatepath, determining whether a category path includes a partial path thatis the candidate path; and upon determining that a category pathincludes a partial path that is the candidate path, adjusting therelevance score of that candidate path to indicate an increasedrelevance.
 11. The method of claim 9 further comprising: determiningwhether no candidate path has a leaf node with a label that matches atail portion of the title of the item description for that item; upondetermining that no candidate path has such a leaf node, identifying oneor more candidate paths that have a leaf node with a label that matchesany portion of the title of the item description for that item.
 12. Themethod of claim 9 wherein an item description includes fields havingfield names and having field content and further comprising, prior toselecting the candidate paths: accessing rules, each rule associatedwith a node, each rule having a field name test, a content test, and arelevance adjustment; for each candidate path, for each node along thecandidate path, for each rule associated with that node, for each fieldwith field name satisfies the field name test of that rule and withfield content that satisfies the field content test of that rule,adjusting the relevance score of that candidate path based on therelevance adjustment of that rule.
 13. The method of claim 8 furthercomprising for each candidate path, adjusting the relevance score of thecandidate path based on a normalization threshold for the classificationlabel of that candidate path, the normalization threshold based onaccuracy of the classifier at assigning the classification label. 14.One or more computing systems for identifying a path of a taxonomy graphto assign to an item description of an item, the one or more computingsystems comprising: one or more computer-readable storage mediumsstoring computer-executable instructions for controlling the one or morecomputing systems to: identify one or more candidate paths within thetaxonomy graph that are relevant to the item, each candidate path havinga relevance score; apply a classifier to the item description toidentify classification paths, each classification path having aclassification score; for each candidate path, generate a finalrelevance score for that candidate path by combining the relevance scorefor that candidate path with the classification score for thecorresponding classification path; and assign to the item one or morecandidate paths with a final relevance score indicating that thecandidate path is relevant to the item; and one or more processors forexecuting the computer-executable instructions stored in the one or morecomputer-readable storage mediums.
 15. The one or more computing systemsof claim 14 wherein an item description includes a title and one or morecategory labels and wherein the instructions that identify of thecandidate paths for an item comprises instructions to: identify one ormore candidate paths that have a leaf node with a label that matches atail portion of the title of the item description for that item; foreach identified candidate path, generate a relevance score for thatcandidate path; determine whether the candidate path includes a partialpath that is a category path, a category path being a path of thetaxonomy graph with a leaf node that has a label that matches a categorylabel; and upon determining that the candidate path includes a partialpath that is a category path, adjust the relevance score of thatcandidate path to indicate an increased relevance; and discard candidatepaths with a relevance score that does not satisfy a thresholdrelevance.
 16. The one or more computing systems of claim 15 wherein theinstructions that identify of the candidate paths for an item comprisesinstructions to, prior to discarding the candidate paths: for eachcandidate path, determine whether a category path includes a partialpath that is the candidate path; and upon determining that a categorypath includes a partial path that is the candidate path, adjust therelevance score of that candidate path to indicate an increasedrelevance.
 17. The one or more computing systems of claim 15 wherein theinstructions further comprise instructions to: determine whether nocandidate path has a leaf node with a label that matches a tail portionof the title of the item description for that item; upon determiningthat no candidate path has such a leaf node, identify one or morecandidate paths that have a leaf node with a label that matches anyportion of the title of the item description for that item.
 18. The oneor more computing systems of claim 15 wherein an item descriptionincludes fields having field names and having field content and whereinthe instructions further comprise instructions to, prior to selectingthe candidate paths: access rules, each rule associated with a node,each rule having a field name test, a content test, and a relevanceadjustment; for each candidate path, for each node along the candidatepath, [1] for each rule associated with that node, for each field withfield name satisfies the field name test of that rule and with fieldcontent that satisfies the field content test of that rule, adjust therelevance score of that candidate path based on the relevance adjustmentof that rule.
 19. The one or more computing systems of claim 14 whereinthe instructions further comprise instructions to, for each candidatepath, adjust the relevance score of the candidate path based on anormalization threshold for the classification label of that candidatepath, the normalization threshold based on accuracy of the classifier atassigning the classification label.